New Language Model Adaptation Algorithm based on the Definition of Cardinal Distributions of the Words

نویسندگان

  • D. Janiszek
  • R. De Mori
  • F. Bechet
  • D. Matrouf
چکیده

Linear transformations are proposed for transforming vectors of Language Model (LM) probabilities. A separate vector is considered for each word and the j-th element of a vector is the probability of observing the word in the context of its j-th history. If a good general LM is available, it is possible to cluster vectors into classes and to infer a transformation for each class. Probability distributions of words which are not observed or which are observed with a low frequency in the adaptation corpus can be obtained by transforming the distribution they have in the general model using the transformation of the cluster they belong to. Experimental results show that there is a interesting range in the size of the adaptation corpus in which perplexity of the adapted LM is lower than the perplexity of the LM whose probabilities are directly estimated from the adaptation data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Haplotype Block Partitioning and tagSNP Selection under the Perfect Phylogeny Model

Single Nucleotide Polymorphisms (SNPs) are the most usual form of polymorphism in human genome.Analyses of genetic variations have revealed that individual genomes share common SNP-haplotypes. Theparticular pattern of these common variations forms a block-like structure on human genome. In this work,we develop a new method based on the Perfect Phylogeny Model to identify haplo...

متن کامل

تشخیص دست‌نوشتۀ‌ برخط فارسی با استفاده از مدل زبانی و کاهش قوانین نگارش کاربر

The Joint-up, cursive form of Persian words and immense variety of its scripts, also different figures of Persian letters depending on their sitting positions in the words, have turned the Persian handwritings recognition to an intense challenge. The major obstacle of the most often recognition ways, is their inattention to sentence contexture which causes utilizing of a word with correct appea...

متن کامل

On Bivariate Generalized Exponential-Power Series Class of Distributions

In this paper, we introduce a new class of bivariate distributions by compounding the bivariate generalized exponential and power-series distributions. This new class contains the bivariate generalized exponential-Poisson, bivariate generalized exponential-logarithmic, bivariate generalized exponential-binomial and bivariate generalized exponential-negative binomial distributions as specia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000